Introduction

Los Angeles County provides openly available data on all restaurant and market inspections over the past 5 years. Facilities are subject to inspection 1 to 3 times a year, and made public within 1 week of inspection date. The frequency in which restaurants and food markets are inspected depends on the public health risk associated with the food products served or prepared and on the facility’s history of inspection grades. Inspectors deduct points based on violations and health risks, which is turned into a score out of 100. In addition, Los Angeles County data from 2018 on population health is publicly available. Demographic data, such as age and race distribution, socioeconomic data, such as proportion receiving EBT and proportion employed, and health outcomes data, such as proportion with asthma and rates of suicide, are provided for 87 cities within Los Angeles County.

Objective

I am interested in exploring restaurant inspection ratings in LA County. I have a few questions, with the main one being: Are restaurant inspection ratings associated with community health status? Secondary questions include: What are the “safest” and most “dangerous” cities in LA County for eating restaurant food? What restaurant chain is the “safest” to eat at? What restaurant chains should one proceed with caution?

Methods

Reading in and wrangling the data

I used 2 data sets which I merged together for this project. Both are available at data.lacounty.gov to download as CSV (I have also uploaded these datasets to my github repository). The first is a dataset of all LA County restaurant inspections. I added LA County city health data to see if there were any relationships between public health outcomes and local restaurant hygiene. These datasets were merged by city name, and no restaurant was missing its city in the first dataset. City names were briefly inspected to ensure matching would be feasible. As Los Angeles had many sub-cities for which there was health data, I only included the “City of Los Angeles” data to represent the local public health for any restaurant with city listed as Los Angeles. Only restaurants in cities with health data were included in this analysis.

if (!file.exists("LACinspections.csv.zip"))
  download(
    url = "https://github.com/v-yin/PM566-Midterm/blob/main/LACinspections.csv.zip",
    dest = "LACinspections.csv.zip",
    mode="wb"
    )
unzip("LACinspections.csv.zip", exdir="./")
inspect <- read.csv("LACinspections.csv")

if (!file.exists("LAChealth.csv"))
  download(
    url = "https://github.com/v-yin/PM566-Midterm/blob/main/LAChealth.csv",
    dest = "LAChealth.csv",
    mode="wb"
    )
health <- read.csv("LAChealth.csv")

health$GEONAME <- toupper(health$GEONAME)
health$GEONAME <- replace(health$GEONAME, health$GEONAME=="LOS ANGELES, CITY OF", "LOS ANGELES")

resthealth <- merge(
  x = inspect,
  y = health,
  by.x = "FACILITY_CITY",
  by.y = "GEONAME",
  all.x = FALSE,
  all.y = FALSE
)
resthealth <- data.table(resthealth)

Data Exploration

Data were explored utilizing R package ggplot2 to create a histogram of scores. Implausibly low scores were deleted (score less than 50, of which there was 1 value with score 3). Average scores within cities were computed and compared. Restaurant chains were identified through tokenizing words as bigrams, and looking to see most common chain restaurants. The following set of 9 chains were selected: McDonald’s, Jack in the Box, Starbucks, El Pollo Loco, Panda Express, Taco Bell, Del Taco, In N Out, Panera Bread. Average chain inspection scores were computed and compared. Measures of public health were selected: proportion with depression, proportion with obesity, proportion with diabetes. To explore a chain restaurant inspections scores with its surrounding city’s health, average score for a chain restaurant was calculated.

In the future, I hope to use OpenStreetMaps API to gather lattitude and longitude data for each restaurant. For now, I was able to use geocode_zip to gather lattitude and longitude for most zip codes in the dataset. The zip codes’ average inspection score, proportion of diabetes, obesity, and depression were calculated. Heat maps of the averages were created to compare spatially where clean/dirty restaurants are in relation to proportion of the population with Diabetes, obesity, or depression.

# Delete scores less than 50
library(data.table)
resthealth <- resthealth[SCORE>50]
# Find average inspection score by city
city_avg <- resthealth[ , .(
  scoreavg = mean(SCORE) 
), by = "FACILITY_CITY"]
# Find restaurant chains
# resthealth %>% unnest_ngrams(word, FACILITY_NAME, n=2) %>% anti_join(stop_words, by = c("word")) %>% count(word, sort=TRUE) %>% as_tibble() %>% print(n=100)
# Create chain variable
resthealth$FACILITY_NAME <- toupper(resthealth$FACILITY_NAME)
resthealth$CHAIN <- ifelse(grepl("MCDONALD", resthealth$FACILITY_NAME), "McDonald's", ifelse(grepl("JACK IN THE BOX", resthealth$FACILITY_NAME), "Jack in the Box", ifelse(grepl("STARBUCKS", resthealth$FACILITY_NAME), "Starbucks", ifelse(grepl("EL POLLO LOCO", resthealth$FACILITY_NAME), "El Pollo Loco", ifelse(grepl("PANDA EXPRESS", resthealth$FACILITY_NAME), "Panda Express", ifelse(grepl("TACO BELL", resthealth$FACILITY_NAME), "Taco Bell", ifelse(grepl("DEL TACO", resthealth$FACILITY_NAME), "Del Taco", ifelse(grepl("OUT BURGER", resthealth$FACILITY_NAME), "In N Out", ifelse(grepl("PANERA BREAD", resthealth$FACILITY_NAME), "Panera Bread", NA)))))))))
# Find average inspection score by city
chain_avg <- resthealth[ , .(
  scoreavg = mean(SCORE) 
), by = "CHAIN"]
# Clean health outcome data to be numeric
resthealth$Prop_obse <- as.numeric(as.character(resthealth$Prop_obse))
## Warning: NAs introduced by coercion
resthealth$Prop_DM <- as.numeric(as.character(resthealth$Prop_DM))
## Warning: NAs introduced by coercion
# Average score by restaurant
zip_avg <- resthealth[ , (ZipAvg = mean(SCORE, by = "FACILITY_ZIP")), list(FACILITY_NAME, FACILITY_ADDRESS, FACILITY_CITY, FACILITY_STATE, FACILITY_ZIP, Prop_DM, Prop_obse, Prop_depr) ]
zip_avg$FACILITY_ZIP <- as.character(zip_avg$FACILITY_ZIP)
# geocode to get lattitude and longitude by zip code
library(zipcodeR)
zip_cord <- geocode_zip(zip_avg$FACILITY_ZIP)
zip <- merge(
  x = zip_avg,
  y = zip_cord,
  by.x = "FACILITY_ZIP",
  by.y = "zipcode",
  all.x = TRUE,
  all.y= FALSE
)

Preliminary Results

In total, there were 252773 inspections of 39423 restaurants in 66 cities within LA County.

Of the 252773 inspections included in the analysis, the average grade was 93.9351117 with a standard deviation of 3.8669866. The highest score was a perfect score, 100 whereas the lowest score was 54. Interestingly, there appears to be a peak at 90, which corresponds to the lowest possible score to achieve an A rating. This may hint at bias involved in the inspection grading process.

library(ggplot2)
scorehisto <- ggplot(resthealth, aes(x=SCORE)) +
  geom_histogram(binwidth=1, fill="red") +
  ggtitle("Distribution of LA County Restaurant Inspection Scores, 2017-2022") +
  xlab("Inspection Score")
scorehisto

Of the 66 cities included in this analysis, Long Beach had the highest average city inspection score. The city with the worst average score was Monterey Park.

city_avg %>% arrange(desc(scoreavg)) %>% slice(1:5) %>% knitr::kable(col.names = c("City", "Average Score"), caption = "Top 5 Cities")
Top 5 Cities
City Average Score
LONG BEACH 97.37500
CALABASAS 96.03195
SANTA CLARITA 95.74072
MAYWOOD 95.60426
EAST LOS ANGELES 95.49254
city_avg %>% arrange(scoreavg) %>% slice(1:5) %>% knitr::kable(col.names = c("City", "Average Score"), caption = "Bottom 5 Cities")
Bottom 5 Cities
City Average Score
MONTEREY PARK 91.63380
ROWLAND HEIGHTS 91.68030
ALHAMBRA 92.36034
GARDENA 92.36540
CERRITOS 92.59487

Of the chain restaurants examined, Panda Express had the best average score, whereas Del Taco had the lowest average score. Cities with Starbucks and Panera appeared to have the lowest proportion of diabetes. However, depression was highest in cities with Starbucks and Panda Express. Cities with Del Taco and Panera Bread had the highest amount of obesity.

chain_avg %>% na.omit() %>% arrange(desc(scoreavg)) %>% slice(1:10) %>% knitr::kable(col.names = c("Chain", "Average Score"), caption = "Chain Ranking")
Chain Ranking
Chain Average Score
Panda Express 97.01446
In N Out 96.89239
Starbucks 96.62458
Taco Bell 96.52941
McDonald’s 95.15971
El Pollo Loco 94.92152
Jack in the Box 94.79830
Panera Bread 94.62241
Del Taco 93.16085
resthealth[ , .(
  avgDM = mean(Prop_DM, na.rm = T),
  avgMDD = mean(Prop_depr, na.rm = T),
  avgOB = mean(Prop_obse, na.rm = T)
), by = "CHAIN"] %>% na.omit() %>% as_tibble() %>% knitr::kable(col.names= c("Chain", "Average Proportion Diabetes", "Average Proportion Depression", "Average Proportion Obesity"), caption = "Proportion of Diabetes, Depression, and Obesity in surrounding city by chain restaurant")
Proportion of Diabetes, Depression, and Obesity in surrounding city by chain restaurant
Chain Average Proportion Diabetes Average Proportion Depression Average Proportion Obesity
Starbucks 0.0984833 0.0875348 0.2299143
Taco Bell 0.1034938 0.0831847 0.2472439
McDonald’s 0.1036016 0.0826779 0.2438592
In N Out 0.1005722 0.0750168 0.2412087
Panda Express 0.1010105 0.0841394 0.2398609
Jack in the Box 0.1053523 0.0794227 0.2500741
Panera Bread 0.0976643 0.0812643 0.2606151
El Pollo Loco 0.1034037 0.0799796 0.2445605
Del Taco 0.1051764 0.0806202 0.2659273

Is there a correlation between restaurant inspection score and surrounding city health?

score.pal <- colorNumeric("inferno", domain=zip$V1)
scoremap <- leaflet(zip) %>%
  addProviderTiles("CartoDB.Positron") %>%
  addCircles(
    lat = ~lat, lng = ~lng,
    label = ~paste0(V1), color = ~score.pal(V1),
    opacity = 1, fillOpacity = 1, radius = 200
  ) %>%
  addLegend("bottomright", pal=score.pal, values = zip$V1, title= "Inspection Score", opacity=1) %>% addTiles()
## Warning in validateCoords(lng, lat, funcName): Data contains 124 rows with
## either missing or invalid lat/lon values and will be ignored
scoremap
DM.pal <- colorNumeric("viridis", domain=zip$Prop_DM)
DMmap <- leaflet(zip) %>%
    addProviderTiles("CartoDB.Positron") %>%
  addCircles(
    lat = ~lat, lng = ~lng,
    label = ~paste0(V1), color = ~DM.pal(Prop_DM),
    opacity = 1, fillOpacity = 1, radius = 200
  ) %>%
  addLegend("bottomright", pal=DM.pal, values = zip$Prop_DM, title= "Proportion in City with Diabetes", opacity=1) %>% addTiles()
## Warning in validateCoords(lng, lat, funcName): Data contains 124 rows with
## either missing or invalid lat/lon values and will be ignored
DMmap

Conclusion